Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a pre-commit script to detect missing i18n implementations #9428

Merged

Conversation

pidgezero-one
Copy link
Contributor

@pidgezero-one pidgezero-one commented Jun 13, 2024

Closes #9423

  • Adds a pre-commit script that finds HTML that might be missing i18n implementation

Technical

Parses the contents of html files in specific directories to detect the line and column numbers for tags that might be missing i18n implementation.

Mostly a series of loops to ensure that the provided indexes are correct.

Testing

I've been adding junk data to html files and running both pre-commit run and pre-commit run --all-files, and it looks like the scope of those two different commands is correct. However, the regex approach returns quite a lot of what I think are false positives.

Screenshot

Stakeholders

@rebecca-shoptaw

@pidgezero-one pidgezero-one changed the title 9423/feat/detect missing i18n WIP: Add a pre-commit script to detect missing i18n implementations Jun 13, 2024
@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Jun 13, 2024

@rebecca-shoptaw @cdrini

So far this appears to be working as expected according to the issue description (except I'm not sure how to get it to only run against PR files instead of all files in the action runner), however my regex approach flags quite a lot of what I think are false positives (example: https://results.pre-commit.ci/run/github/69609/1718241262.NJ_4UJGwTqOunVrs7ivFkA)

Some examples:

  • I think this should ignore <code> elements.
  • Should it be considered an error when the contents of an element are a template variable? Example: ERRO openlibrary/macros/SearchResults.html:23:36 <a href="$book.key/$title.replace(' ', '-')">$book.title_prefix $title</a></span> or ERRO openlibrary/templates/covers/book_cover_single_edition.html:18:16 <div class="BookTitle">$:macros.TruncateString(title, 70)
  • I'm not sure what the meaning is of \$$, for example: ERRO openlibrary/templates/admin/sponsorship.html:69:6 <td>\$$('{:,.2f}'.format(summary['avg_scan_cost'] / 100.))</td> - is this something else that should avoid flagging?
  • There are some instances where the same error is flagged twice when an outer element is on the same line as an inner element and they're separated by a superfluous space:
ERRO openlibrary/templates/subjects.html:79:12  <span class="count">        <a href="/search?${page.subject_type}_facet=$page.name.replace('&','%26')">Search within $page.name</a></span>
ERRO openlibrary/templates/subjects.html:79:40  <a href="/search?${page.subject_type}_facet=$page.name.replace('&','%26')">Search within $page.name</a></span>

(I am assuming that this is an indication that the space should be deleted and not that the script should account for this?)

  • I don't have a good answer for tags that are valid inside i18n functions, like <strong> or <em> or maybe <span>. I thought about ignoring any result that matches those three tags, but then that would miss things like <div><span>untranslated text</span></div>. I've also considered stripping those tags from the text before processing it, but that would cause incorrect position indexes in the error messages.

Please let me know what your thoughts are on the above and I'll be happy to make any changes!

@rebecca-shoptaw
Copy link
Collaborator

@pidgezero-one This is looking great!!
To respond in order:

  • I think this should ignore <code> elements.
  • +1 re: ignoring <code> divs, that's a good catch
  • Should it be considered an error when the contents of an element are a template variable? Example: ERRO openlibrary/macros/SearchResults.html:23:36 <a href="$book.key/$title.replace(' ', '-')">$book.title_prefix $title</a></span> or ERRO openlibrary/templates/covers/book_cover_single_edition.html:18:16 <div class="BookTitle">$:macros.TruncateString(title, 70)
  • I agree that $-escaped template variables should be ignored, ideally a $ that is not followed by a ( (the syntax intended to trigger WARN) should be fine
  • I'm not sure what the meaning is of \$$, for example: ERRO openlibrary/templates/admin/sponsorship.html:69:6 <td>\$$('{:,.2f}'.format(summary['avg_scan_cost'] / 100.))</td> - is this something else that should avoid flagging?
  • I believe that's an escaped $ associated with a dollar amount (scan cost), a probably very unusual edge case, but we should avoid flagging it

(I am assuming that this is an indication that the space should be deleted and not that the script should account for this?)

  • +1, I think you can go ahead and delete the space! I imagine this script is going to catch a number of random little errors like that which we can just fix
  • I don't have a good answer for tags that are valid inside i18n functions, like <strong> or <em> or maybe <span>. I thought about ignoring any result that matches those three tags, but then that would miss things like <div><span>untranslated text</span></div>. I've also considered stripping those tags from the text before processing it, but that would cause incorrect position indexes in the error messages.
  • @cdrini has a good solution for this which he can describe in more detail! I believe that the idea is that once you've found a correctly translated string, you check whether any untranslated strings are inside it and ignore those? Or something along those lines, he can go through it fully.

And re: how to just run on changed files, I believe pre-commit will automatically just pass changed files to the function, i.e. i18n_checker.py foo1.html foo2.html if running following an HTML commit. Again CC @cdrini for how exactly the process works, pre-commit itself should be doing most of the heavy lifting for you!

@pidgezero-one
Copy link
Contributor Author

Thank you @rebecca-shoptaw ! I've added some changes that will handle the first four points and will need to think a bit more about how to get the fifth to work.

When I run pre-commit run locally, it only runs against staged files, but in the CI pipeline it looks like it's running for all files. I'm not too familiar with the CI pipeline yet so I'm not sure how to fix that!

@rebecca-shoptaw
Copy link
Collaborator

rebecca-shoptaw commented Jun 14, 2024

@pidgezero-one Great, thank you! Definitely worth waiting for Drini's input on number 5, he has a solid method he thought through for it some time ago and can easily pass along. Very glad it's running against staged files locally, we can investigate re: the CI.

@cdrini
Copy link
Collaborator

cdrini commented Jun 14, 2024

That all looks good! For 5, I'd say if the line/char with the > prefix error has a $: somewhere on the line before the error, then we can skip it. Since anything that contains html will need a $: before it (e.g. $:_('Stay <strong>strong</strong>'). That should let us avoid these with very few false positives!

@pidgezero-one
Copy link
Contributor Author

Thank you for the guidance @rebecca-shoptaw and @cdrini! 😊 I believe the next challenge is that it flags elements which contain punctuation and no words - do we want to go as far as to treat those as errors (or even warnable if encapsulated in $())?

@cdrini
Copy link
Collaborator

cdrini commented Jun 14, 2024

Wow the results for this are looking so great!! This is going to be so useful 😊

Hmmm can you give some examples? I saw a few with &times; and &nbsp; and &larr;. Maybe we can erase these if they're the thing that immediately follow the failing >? Something like replace (&times;|&nbsp;|&larr;)+\s* after the failing > with '', and see if it still fails? Would that be too difficult?

@cdrini
Copy link
Collaborator

cdrini commented Jun 14, 2024

Oh also can you try running the script with time? I'm curious to see how long it takes!

@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Jun 15, 2024

@cdrini So I think this might actually be a trivial problem to solve with the \p{L} operator, but that would require importing the pip regex library (the standard re lib does not support it). Do you think it'd be worth introducing a new dependency?

I think that access to that operator could be useful to have in general (I've used it in other projects for regexes that needed to look for non-romance words, for example) but wanted to get your (and @rebecca-shoptaw )'s opinions before doing anything drastic.

And when you say running the script with time, is that a pre-commit argument or do you mean literally recording timestamps within the script?

@cdrini
Copy link
Collaborator

cdrini commented Jun 15, 2024

Hmm, how would \p{L} help in this case?

entry: python ./scripts/detect_missing_i18n.py
types: [html]
language: python
verbose: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this so that SKIP lines will still show even when the exit code is 0. Pre-commit will hide the script output if the exit code is 0.

@pidgezero-one pidgezero-one marked this pull request as ready for review June 24, 2024 12:51
@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Jun 24, 2024

TIL about require_serial in pre-commit! The first problem is fixed now (and I can also see now that the total time to run this script is 0.20!!!), but the discrepancies between CI and local behaviour are still an issue.

@pidgezero-one pidgezero-one requested a review from cdrini June 24, 2024 12:57
@cdrini cdrini changed the title WIP: Add a pre-commit script to detect missing i18n implementations Add a pre-commit script to detect missing i18n implementations Jun 25, 2024
cdrini added 3 commits June 25, 2024 12:35
We only had two files outside of the valid dirs, and it makes this CLI a
little cleaner since every specified file will now be processed instead
of some silently skipped.
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looking good! Ah this is failing locally, the text is just hidden since there's a lot of output. The bug is related to when the errcount is incremented! Should be super close know 😊

Run pre-commit run detect-missing-i18n --all-files then echo $? to see the status code.

I'm going to be offline for the next spell but this is the last bug then we should be good to go! Would you mind taking over for me on this one @rebecca-shoptaw ?

scripts/detect_missing_i18n.py Outdated Show resolved Hide resolved
elif includes_error_attribute:
char_index = includes_error_attribute.start()
errtype = Errtype.ERR
errcount += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is where the bug is! The errors are being counted even when we continue down below. We should only +1 at the end after we run all the continues.

@pidgezero-one
Copy link
Contributor Author

Run pre-commit run detect-missing-i18n --all-files then echo $? to see the status code.

I think there's still a discrepancy - I'm also referring to the case where every file is skipped. Up until my most recent changes, pre-commit run detect-missing-i18n --all-files skips every file and shows an error code of 0 locally using echo $?, but skipping every file in the CI exits with code 1.

@cdrini cdrini assigned rebecca-shoptaw and unassigned cdrini Jun 25, 2024
@pidgezero-one
Copy link
Contributor Author

Figured it out - the CI is running the script against files I hadn't yet pulled locally and detecting errors in them. I've added all outstanding files to the exclude list so we can start tackling them. All checks pass now!

@pidgezero-one pidgezero-one requested a review from cdrini June 25, 2024 13:36
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiice good thinking! This code lgtm! 😊 @rebecca-shoptaw if you could give it a quick sanity check QA, and then merge, that would be awesome! (I'm afk for a spell 😁)

Copy link
Collaborator

@rebecca-shoptaw rebecca-shoptaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pidgezero-one Great work on this!!
QA:
Tried running pre-commit run --all-files with no i18n formatting changes:
✅ Detect missing i18n check passes
❓ Output for all skipped files appears, but if this only happens for --all-files which seems to be the case, that's fine with me and could be useful
❓It would be awesome if the skipped files comment was even clearer re: the process to fix the problem(s), i.e. "remove the file from exclude list then run the script again to see what the issue is"

Tried running pre-commit with a test commit after making an i18n formatting error in a non-skipped file:
✅ Successfully got an error message when un-i18n-syntaxing both basic text and complex edge cases
✅ Did not see skipped files output

Tried running pre-commit with a test commit after adding new non-i18n-syntaxed text:
✅ Successfully got an error message
✅ Error message disappeared and commit succeeded when I fixed the syntax

Looks great to me! I'm happy to approve a merge, and I'm thinking that once it's merged I'll assign myself a deep dive into those skipped files to try to fix as many as I can, and can bundle a minor instructions clarification into that. 🙂

@rebecca-shoptaw rebecca-shoptaw added the Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. label Jun 25, 2024
@pidgezero-one
Copy link
Contributor Author

pidgezero-one commented Jun 25, 2024

Thank you @rebecca-shoptaw ! 😊

❓ Output for all skipped files appears, but if this only happens for --all-files which seems to be the case, that's fine with me and could be useful

Yep, this is the case! Regardless of if a file is skipped or errored, it'll only show if that file was passed to the script, so without --all-files it would only apply to staged files.

❓It would be awesome if the skipped files comment was even clearer re: the process to fix the problem(s), i.e. "remove the file from exclude list then run the script again to see what the issue is"

By "skipped files comment" what does this refer to? A comment in the code, or the script's output, or the instructions I put in the wiki?

I'll be happy to address that ASAP so it can be merged - unfortunately the longer this PR remains open, the more updates I'll have to make to it as other PRs are merged with HTML changes that will set off the CI 😵‍💫

@rebecca-shoptaw
Copy link
Collaborator

rebecca-shoptaw commented Jun 25, 2024

@pidgezero-one No need to make any more changes! I just meant the comment in the code itself (added with "explain exclude list"), and I'll tweak it if necessary when I do my new/related PR to fix up those excluded files 🙂

I've approved the changes, we just need a staff member with merge powers to go ahead and merge

@pidgezero-one
Copy link
Contributor Author

@rebecca-shoptaw Oh okay cool! Thank you so much! 😄

@cdrini
Copy link
Collaborator

cdrini commented Jun 25, 2024

Woohoo!! Thank you so much folks!!

@cdrini cdrini merged commit c546c6a into internetarchive:master Jun 25, 2024
4 checks passed
@rebecca-shoptaw rebecca-shoptaw removed the Needs: Staff / Internal Reviewed a PR but don't have merge powers? Use this. label Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create python script to automatically detect potentially missing i18n strings
3 participants